-
Notifications
You must be signed in to change notification settings - Fork 535
feat(datafusion): add max_temp_directory_size parameter for z-order and compact operations for DataFusion #3847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3847 +/- ##
==========================================
- Coverage 73.72% 73.63% -0.09%
==========================================
Files 151 151
Lines 39250 39272 +22
Branches 39250 39272 +22
==========================================
- Hits 28936 28919 -17
- Misses 9005 9045 +40
+ Partials 1309 1308 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
preserve_insertion_order: false, | ||
max_concurrent_tasks: num_cpus::get(), | ||
max_spill_size: 20 * 1024 * 1024 * 1024, // 20 GB. | ||
max_temp_directory_size: 100 * 1024 * 1024 * 1024, // 100 GB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not expose it here if we immediately deprecate it, but rather create a proper configured Session in the python binding when calling optimize and passing that around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the simple approach.
I can take more time to create a properly configured session, something like this:
fn create_optimize_session(
max_spill_size: usize,
max_temp_directory_size: u64,
) -> SessionState {
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okey dokey!
I will refactor this to remove depreciation and configure a proper session. I will ping you when it's ready.
c82d4ba
to
332179c
Compare
I looked into this similarly and unfortunately, there doesn't seem to be a way to allow passing in a datafusion SessionState from the python side to the rust side, which would give python callers the maximum flexibility :/ |
Not a state, but you can consume FFISessionConfig, not sure if it has all the some buttons and knobs as state/context |
It accomplishes the goal on the per-operation level, but if you wanted to share a SessionContext across multiple operations (say, to limit the total amount of ram for working with multiple tables concurrently) you're out of luck. I guess the rust side could store a SessionContext as an optional in RawDeltaTable and pass that in every operation to achieve the same goal, though. Definitely not done in this ticket, though :D |
b0d47cf
to
79dcc10
Compare
Something like:
We can have a follow-up PR for this, WDYT? |
Yeah might be nice for interopability with datafusion-python |
@fvaleye @ion-elgreco @abhiaagarwal - I just opened #3860 to create a more narrow waist when creating datafusion sessions. Going forward we also need to wire our object-store and log store factories with the datafusion session. Would it be possible to base this PR on the one above to properly configure the session? |
Yes, I will do that @roeap 👍 |
…nContext Signed-off-by: Florian Valeye <[email protected]>
2fc4046
to
6652cef
Compare
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
@fvaleye can you resolve the conflict, then we are good to go :) |
16e6111
to
6652cef
Compare
I'm on it! |
4648d1b
to
5c83b44
Compare
Description
max_temp_directory_size
parameter to the Python bindings to optimize DataFusion operations.max_spill_size
to the Python binding (available in Rust)Related Issue(s)
max_temp_directory_size
in the disk manager configuration when optimizing large table #3833